.. _`K-means Clustering`:

.. _`org.sysess.sympathy.machinelearning.k_means`:

K-means Clustering
==================

.. image:: dataset_blobs.svg
   :width: 48


Clusters data by trying to separate samples in n groups of equal variance


**Documentation**

Clusters data by trying to separate samples in n groups of equal variance

*Configuration*:


  - *n_clusters*

    The number of clusters to form as well as the number of
    centroids to generate.


  - *n_init*

    Number of time the k-means algorithm will be run with different
    centroid seeds. The final results will be the best output of
    n_init consecutive runs in terms of inertia.


  - *init*

    Method for initialization:

    'k-means++' : selects initial cluster centers for k-mean
    clustering in a smart way to speed up convergence. See section
    Notes in k_init for more details.

    'random': choose `n_clusters` observations (rows) at random from data
    for the initial centroids.

    If an ndarray is passed, it should be of shape (n_clusters, n_features)
    and gives the initial centers.

    If a callable is passed, it should take arguments X, n_clusters and a
    random state and return an initialization.


  - *algorithm*

    K-means algorithm to use. The classical EM-style algorithm is "full".
    The "elkan" variation is more efficient on data with well-defined
    clusters, by using the triangle inequality. However it's more memory
    intensive due to the allocation of an extra array of shape
    (n_samples, n_clusters).

    For now "auto" (kept for backward compatibiliy) chooses "elkan" but it
    might change in the future for a better heuristic.

    .. versionchanged:: 0.18
        Added Elkan algorithm


  - *max_iter*

    Maximum number of iterations of the k-means algorithm for a
    single run.


  - *tol*

    Relative tolerance with regards to Frobenius norm of the difference
    in the cluster centers of two consecutive iterations to declare
    convergence.


  - *precompute_distances*

    Precompute distances (faster but takes more memory).

    'auto' : do not precompute distances if n_samples * n_clusters > 12
    million. This corresponds to about 100MB overhead per job using
    double precision.

    True : always precompute distances.

    False : never precompute distances.

    .. deprecated:: 0.23
        'precompute_distances' was deprecated in version 0.22 and will be
        removed in 0.25. It has no effect.


  - *n_jobs*

    The number of OpenMP threads to use for the computation. Parallelism is
    sample-wise on the main cython loop which assigns each sample to its
    closest center.

    ``None`` or ``-1`` means using all processors.

    .. deprecated:: 0.23
        ``n_jobs`` was deprecated in version 0.23 and will be removed in
        0.25.


  - *random_state*

    Determines random number generation for centroid initialization. Use
    an int to make the randomness deterministic.
    See random_state.


*Attributes*:


  - *cluster_centers_*

    Coordinates of cluster centers. If the algorithm stops before fully
    converging (see ``tol`` and ``max_iter``), these will not be
    consistent with ``labels_``.


  - *labels_*

    Labels of each point


  - *inertia_*

    Sum of squared distances of samples to their closest cluster center.


*Input ports*:


*Output ports*:
    **model** : model
        Model


**Definition**


*Input ports*


*Output ports*

    :model:  model

        Model


.. automodule:: node_clustering

.. class:: KMeansClustering